Search CORE

17 research outputs found

A Spanish text corpus for the author profiling task

Author: Cagnina Leticia
Errecalde Marcelo Luis
Garciarena Ucelay María José
Villegas María Paula
Publication venue
Publication date: 01/10/2014
Field of study

Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

A Spanish text corpus for the author profiling task

Author: Cagnina Leticia
Errecalde Marcelo Luis
Garciarena Ucelay María José
Villegas María Paula
Publication venue
Publication date: 04/11/2014
Field of study

k-TVT: a flexible and effective method for early depression detection

Author: Cagnina Leticia
Errecalde Marcelo Luis
Funez Dario G.
Garciarena Ucelay María José
Villegas María Paula
Publication venue
Publication date: 01/10/2019
Field of study

The increasing use of social media allows the extraction of valuable information to early prevent some risks. Such is the case of the use of blogs to early detect people with signs of depression. In order to address this problem, we describe k-temporal variation of terms (k-TVT), a method which uses the variation of vocabulary along the different time steps as concept space to represent the documents. An interesting particularity of this approach is the possibility of setting a parameter (the k value) depending on the urgency (earliness) level required to detect the risky (depressed) cases. Results on the early detection of depression data set from eRisk 2017 seem to confirm the robustness of k-TVT for different urgency levels using SVM as classifier. Besides, some recent results on an extension of this collection would confirm the effectiveness of k-TVT as one of the state-of-the-art methods for early depression detection.XVI Workshop Bases de Datos y Minería de Datos.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

k-TVT: a flexible and effective method for early depression detection

Author: Cagnina Leticia
Errecalde Marcelo Luis
Funez Dario G.
Garciarena Ucelay María José
Villegas María Paula
Publication venue
Publication date: 01/10/2019
Field of study

Análisis de rasgos lingüísticos con técnicas de procesamiento del lenguaje natural en la detección temprana de depresión

Author: Cagnina Leticia Cecilia
Errecalde Marcelo Luis
Garciarena Ucelay María José
Publication venue: Anales de Lingüística
Publication date: 21/12/2021
Field of study

The development of computational methods using information from the Web for early detection of risks is a socially relevant, scientifically attractive and currently a growing area of research. Depression is one of the most frequent mental disorders in the world and with high incidence of suicide in the most severe cases. Therefore, early detection of this illness could lead to a timely treatment and to save lives. This paper analyzes the relationship between computational models that allow the automatic detection of depression and the linguistic properties of the text written by people who experience the disease. State-of-the-art text representations in document classification are used, covering linguistic, syntactic and semantic aspects. The results obtained with standard classifiers indicate that word embeddings capture precise information to detect quickly and safely signs of depression.El desarrollo de métodos computacionales que utilizan información de la Web para la detección temprana de riesgos es un área de investigación socialmente relevante, científicamente atractiva y actualmente en pleno crecimiento. La depresión es uno de los trastornos mentales más frecuentes a nivel mundial y con alta incidencia de suicidio en los casos más severos. Por lo tanto, su detección temprana podría derivar en un tratamiento a tiempo e incluso salvar vidas. En este trabajo, se analiza la relación que existe entre los modelos computacionales que permiten la detección automática de depresión y las propiedades lingüísticas del texto escrito por personas que experimentan la enfermedad. Se utilizan representaciones textuales que forman parte del estado del arte en clasificación de documentos y que cubren aspectos lingüísticos, sintácticos y semánticos. Los resultados obtenidos con clasificadores estándares indican que las incrustaciones de palabras capturan información precisa para detectar indicios de depresión de forma rápida y segura

Centro Universitario Mendoza, Facultad de Filosofía y Letras: Open Journal Systems FFYL

An experimental study for the Cross Domain Author Profiling classification

Author: Cagnina Leticia
Errecalde Marcelo Luis
Garciarena Ucelay María José
Villegas María Paula
Publication venue
Publication date: 28/12/2015
Field of study

Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to the potential applications in security, crime detection and marketing, among others. An interesting point is to study the robustness of a classifier when it is trained with a dataset and tested with others containing different characteristics. Commonly this is called cross domain experimentation. Although different cross domain studies have been done for datasets in English language, for Spanish it has recently begun. In this context, this work presents a study of cross domain classification for the author profiling task in Spanish. The experimental results showed that using corpora with different levels of formality we can obtain robust classifiers for the author profiling task in Spanish language.XII Workshop Bases de Datos y Minería de Datos (WBDDM)Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

On the Importance of Data Representation for the Success of Text Classification

Author: Cagnina Leticia
Cuello Carolina Y.
Garciarena Ucelay María José
Jofre Caradonna Vanessa
Publication venue
Publication date: 02/03/2023
Field of study

Text mining approaches use natural language processing to automatically extract patterns from texts. Tasks as topic labeling, news classification, question answering, named entity recognition and sentiment analysis, usually require elaborate and effective document representations. In this context, word representation models in general, and vector-based word representations in particular, have gained increasing interest to alleviate some of the limitations that Bag of Words exhibits. In this article, we analyze the use of several vector-based word representations besides the classical ones, in a polarity analysis task on movie reviews. Experimental results show the effectiveness of more elaborate representations in comparison to Bag of Words. In particular, Concise Semantic Analysis representation seems to be very robust and effective because independently the classifier used with, the results are really good. Dimension and time of getting the representations are also showed, concluding in the efficiency of the classifiers when Concise Semantic Analysis is considered.XIX Workshop Base de Datos y Minería de Datos (WBDMD)Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

New applications of text categorization methods like opinion mining and sentiment analysis, author profiling and plagiarism detection requires more elaborated and effective document representation models than classical Information Retrieval approaches like the Bag of Words representation. In this context, word representation models in general and vector-based word representations in particular have gained increasing interest to overcome or alleviate some of the limitations that Bag of Words-based representations exhibit. In this article, we analyze the use of several vector-based word representations in a sentiment analysis task with movie reviews. Experimental results show the effectiveness of some vector-based word representations in comparison to standard Bag of Words representations. In particular, the Second Order Attributes representation seems to be very robust and effective because independently the classifier used with, the results are good.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI